This workflow is designed to output gene expression counts from STAR
aligner using --quantmode.
It will also perform general QC statistics on the fastqs with fastqc
and the alignment using rseqc. Finally, the QC reports
are collected into a single file using multiQC.
A DAG (directed acyclic graph) of the workflow is show below:
First, fork the repository from Children’s bitbucket. Do this by clicking the “create fork” symbol from the bitbucket web interface and fork it to your personal bitbucket account, as illustrated below.
Next, you will need to clone your personal repository to your home in Cybertron. See the image below for where you can find the correct URL on your forked bitbucket repo.
Copy that URL to replace
https://childrens-atlassian/bitbucket/scm/~jsmi26/rnaseq_count_nf.git
below.
#on a terminal on the Cybertron login nodes
cd ~
# your fork should have your own userID (rather than jsmi26)
git clone https://childrens-atlassian/bitbucket/scm/~jsmi26/rnaseq_count_nf.gitFinally, grab a compute node and activate the conda environment. It
is also be best practice to use tmux or screen
to ensure that if at the session is disconnected, then you’re nextflow
workflow (if running) won’t end with SIGKILL error.
#Find your project code by listing all your projects on the Cybertron terminal
project info
#Grab a compute note
qsub -I -q freeq -l select=1:ncpus=1:mem=8g -l walltime=8:00:00 -P [PROJECT CODE]
cd /path/to/cloned/rnaseq_count_nfIf you don’t have conda installed yet, please follow these directions. You may stop following the directions after the conda deactivate step.
Next, for the conda environment to be solved, you will need to set channel_priority to flexible in your conda configs as well. To read more about conda environments and thier configurations, check out the documentation.
conda config --describe channel_priority # print your current conda settings
conda config --set channel_priority flexible # set to flexible if not already done#Create the environement only once. Skip this step if you've already created the environment
conda env create -f env/nextflow.yaml
#Activate the conda environment.
conda activate nextflowEdit the nextflow.config file in any text editor; the
example below is in R. You will need to change the project code (use the
same one as you used above), and the queue name to be paidq or a tier 3
queue. Paidq will cost less than $0.01 for testing with the workflow’s
example data provided in the directory test_data.
usethis::edit_file("../nextflow.config")//global parameters
params {
// general options
sample_sheet = "test_data/paired_end_sample_sheet.csv"
download_sra_fqs = false
queue = 'paidq'
project = '[PROJECT CODE]'
<...continues...>
Determine if the workflow works on your installation of the conda environment by running the following command.
./main_run.sh "paired_end_test"To test the single-end sheet, modify the sample_sheet parameter in
the nextflow.config and the output directory
(outdir).
params {
// general options
sample_sheet = "test_data/single_end_sheet.csv"
[...]
outdir = "./single_end_results/"
<...continues...>
}
then run the command
./main_run.sh "single_end_test"To test the sra sample sheet - modify these the sample sheet, and set
download_sra_fastqs to true in the
nextflow.config and the output directory
(outdir).
params {
// general options
sample_sheet = "test_data/sra_sample_sheet.csv"
download_sra_fqs = true
[...]
outdir = "./sra_results/"
<...continues...>
}
then run on the command:
./main_run.sh "sra_test"A comma delimited (csv) sample sheet is required for the input samples to be processed. Please note, do not remove the comment lines that begin with “#” in the example files. The same number of comment lines must be included in any input sample sheet, based on the examples provided here.
It must have the column names (in any order):
r1 - the filepath for the read 1 fastq in paired-end RNA-seq, or the single-end fastq file
r2 - the filepath for the read 2 fastq in paired-end RNA-seq
id - unique sample ID, no duplicates allowed in the sample sheet
single_end - boolean [true/false] if the data is single-end or paired-end
TO DO: A function provided here, but it may not meet the needs of every experiment.
The two examples are provided here to look at:
example_sheet <- read.csv(here::here("test_data/paired_end_sample_sheet.csv"),
header = TRUE, comment.char = "#")
example_sheetIf downloading the fastq files directly from the SRA, the sample
sheet only requires the id and the single_end
columns.
sra_example <- read.csv(here::here("test_data/sra_sample_sheet.csv"),
header = TRUE, comment.char = "#")
sra_exampleEdit the nextflow.config file to include the appropriate
filepaths for the samples to be included in the pipeline, and the
appropriate genome references. The required files are listed here:
## //working directory for temporary/intermediate files produced in the workflow processes
## workDir = "$HOME/temp"
##
## //global parameters
## params {
## // general options
## sample_sheet = "test_data/paired_end_sample_sheet.csv"
## download_sra_fqs = false
## queue = 'paidq'
## project = '207f23bf-acb6-4835-8bfe-142436acb58c'
## outdir = "./paired_end_results/"
## publish_dir_mode = 'copy'
##
## //star specific params
## // Must be full filepaths for files outside the projectDir
## index = '/gpfs/shared_data/STAR/human_v38/STAR_2.7'
## build_index = false
## gtf = '/gpfs/shared_data/STAR/human_v38/STAR_2.7/genes.gtf'
## fasta = '/gpfs/shared_data/STAR/human_v38/STAR_2.7/genome.fa'
## star_ignore_sjdbgtf = false
## <...>
The gene_list and ref_gene_model can be
found at the RSEQC documentation
page and are located at /gpfs/shared_data to share with
the SCRI.
However, the ref_gene_model must be in BED12 format and must match the transcript IDs used in the GTF for STAR aligner. The BED12 can be produced from the GFT file using UCSC utilities (Kent tools).
In the nextflow.config, you can define additional
command line arguments to the scientific software under process
scope.
## // Computational resource allocation for the processes run in the workflow
## process {
## publishDir = [
## path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
## mode: params.publish_dir_mode,
## saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
## ]
## errorStrategy = "retry"
## maxRetries = 2
##
## //STAR-aligner process specific parameters
## //https://www.nextflow.io/docs/latest/process.html#dynamic-computing-resources
## withName: STAR_ALIGN {
## cpus = { 4 * task.attempt }
## memory = { 32.GB * task.attempt }
## <...>
You may use the advanced options to change computational resources
requested for different processes. The CPUs and memory parameters can
updated to request a larger amount of resources like CPUs or memory if
files are large. You may also edit the commandline parameters for
processes in the workflow using the ext.args directive.
The current TRIMGALORE paramters look like this:
## //Trimgalore process specific parameters
## withName: TRIMGALORE {
## cpus = { 2 * task.attempt }
## memory = { 8.GB * task.attempt }
## ext.args = ''
## }
##
## <...>
But you’d like to request 16Gb of memory for and gzip the output.
## //Trimgalore process specific parameters
## withName: TRIMGALORE {
## cpus = { 2 * task.attempt }
## memory = { 16.GB * task.attempt }
## ext.args = '--gzip'
## }
##
## <...>
Then execute a wrapper around the nextflow run main.nf
command which is in the main_run.sh shell script. Provide a
descriptive name (string) for your workflow run, in this example we will
use “my_analysis”.
Typically, you will not need to change the main_run.sh
often.
The main_run.sh script defines the profiles for
different executors in the variable NFX_PROFILE. The
choices for profiles are:
“PBS_singularity” is recommended. This profiles executes the jobs on the HPC using the PBS scheduler and then will run the job inside singularity containers with the appropriate scientific software versions.
“local_singularity” is good for workflow development if you’re making a lot of changes. This will use singularity on Cybertron, but run the jobs on the interactive compute node that you’ve requested during “Set-up Nextflow Environment” steps above.
“PBS_conda” is generally not recommended.This profiles executes the jobs on the HPC using the PBS scheduler. Then nextflow will coordinate the creation of conda environments for each step in the workflow and active those environments in the scheduled compute nodes for job.
## #!/bin/bash
##
## set -eu
## DATE=$(date +%F)
## NFX_CONFIG=./nextflow.config
## #Options: local_singularity, PBS_singularity, and PBS_conda
## NFX_PROFILE='PBS_singularity'
## #Options: star_index or rnaseq_count or sra_fastqs
## NFX_ENTRY='rnaseq_count'
## #The output prefix on filenames for reports/logs
## <...>
You can also change the entry_point
for the workflow.
NFX_ENTRY='star_index'.NFX_ENTRY='sra_fastqs'.NFX_ENTRY='rnaseq_count' to run the
complete pipeline../main_run.sh "my_analysis"Under the path provided in the nextflow config for params “outdir”, you will find directories named for each of the modules. Lets say “params.outdir = ./results”. There will be the following file structure:
results/
In addition, there will be an HTML report with information on where
the temp data is stored in the workDir path, and general
run statistics such as resource utilized versus requested, which helps
with optimization. It will also provide information on how much walltime
was used per sample, total CPU hours, etc.
The HTML file is found in reports directory and will
have the prefix defined on the command line when the
./main_run.sh "my_analysis" was invoked, so in this example
it would be named “my_analysis_{DATE}.html”.
There will also be a detailed nextflow log file that is useful for de-bugging which will also be named in this example, “my_analysis_{DATE}_nextflow.log”.
Finally, the pipeline will produce a DAG - Directed acyclic graph
which describes the workflow channels (inputs) and the modules. The DAG
image will be saved under dag/ directory with the name
“my_analysis_{DATE}_dag.pdf”.
Nextflow has an utility to clean up old work directories and logs that are no longer needed. This can be run after any amount of time to keep your workdir from getting too large or if you’re running out of disk space.
This requires the session ID or session name, which can found in the
.nextflow/history file.
nextflow log
nextflow clean -f [RUN NAME]sessionInfo()